Trio-ER: The Trio System as a Workbench for Entity-Resolution

نویسندگان

  • Parag Agrawal
  • Robert Ikeda
  • Hyunjung Park
  • Jennifer Widom
چکیده

Entity-resolution (also known as deduplication, record linkage, and reference reconciliation, among others) was one of the original motivating applications [6] for the Trio system, which has been under development at Stanford over the past several years. • Entity-resolution is the process of determining when multiple data records are likely to represent the same real-world entity, and possibly merging such records [5]. • Trio is a research prototype DBMS for managing data that includes uncertainty, and for tracking lineage automatically as queries are performed [2]. Uncertainty is an important component of entity-resolution: match functions may return confidence values, and merged records may retain alternative possible values for attributes. Lineage also is important, since it tracks the original records contributing to a merged result—useful for debugging, and for entity-resolution algorithms that may “unmerge” records. In Trio, lineage also enables computing confidence values for merged records, and it supports useful processing steps in iterative entity-resolution (Section IV-E). Trio-ER is a new variant of the Trio system tailored as a workbench for entity-resolution. The goal of the first version of Trio-ER has been to enable rapid prototyping of an important basic class of entity-resolution algorithms using Trio. As an example of rapid prototyping in Trio-ER, consider a table Hotel(URL,name,zip) containing hotel information. (One of the real data sets we have been using is hotel information from Yahoo! Travel.) Assume we have a string comparison function StrComp(s1,s2) that returns a similarity value in [0, 1]. Suppose we believe two hotel records represent the same hotel if their zip is identical, and they have more than a 0.95 string-match on either URL or name. When records are merged, all variants of URL and name are retained. The following query, expressed in Trio’s SQL extension called TriQL [1], performs one iteration of pairwise matching and merging:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

How DoesStrawson Unify Epistemology, Ontology and Logic

Strawson’s conception of analysis as a ‘connective linguistic analysis’ makes it possible for him to achieve an indefinitely large range of ideas or concepts among them are certain numbers of fundamental, general and pervasive concepts or concept-types which not only are pre-theoretical or ahistorical, but also together constitute a structural framework only within whichlogic, ontology and epis...

متن کامل

Trio gene is required for mouse learning ability

Trio is a guanine nucleotide exchange factor with multiple guanine nucleotide exchange factor domains. Trio regulates cytoskeleton dynamics and actin remodeling and is involved in cell migration and axonal guidance in neuronal development. The null allele of the Trio gene led to embryonic lethality, and Trio null embryos displayed aberrant organization in several regions of the brain at E18.5, ...

متن کامل

Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS∗

Trio is a new kind of database system that supports data, uncertainty, and lineage in a fully integrated manner. The first Trio prototype, dubbed Trio-One, is built on top of a conventional DBMS using data and query translation techniques together with a small number of stored procedures. This paper describes Trio-One’s translation scheme and system architecture, showing how it efficiently and ...

متن کامل

Tyrosine phosphorylation of the Rho guanine nucleotide exchange factor Trio regulates netrin-1/DCC-mediated cortical axon outgrowth.

The chemotropic guidance cue netrin-1 mediates attraction of migrating axons during central nervous system development through the receptor Deleted in Colorectal Cancer (DCC). Downstream of netrin-1, activated Rho GTPases Rac1 and Cdc42 induce cytoskeletal rearrangements within the growth cone. The Rho guanine nucleotide exchange factor (GEF) Trio is essential for Rac1 activation downstream of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009